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ABSTRACT 

We present the first annotated corpus of nonverbal behaviors 
in receptionist interactions, and the first nonverbal corpus 
(excluding the original video and audio data) of service en- 
counters freely available online. Native speakers of American 
English and Arabic participated in a naturalistic role play 
at reception desks of university buildings in Doha, Qatar 
and Pittsburgh, USA. Their manually annotated nonverbal 
behaviors include gaze direction, hand and head gestures, 
torso positions, and facial expressions. We discuss possible 
uses of the corpus and envision it to become a useful tool 
for the human-robot interaction community. 

1. INTRODUCTION 

Behavioral realism has been one of the promising direc- 
tions in the development of on-screen conversational agents 
and robots capable of natural language dialogue (see [l9] 
for an overview). For example, interactions with a robot 
receptionist that evoke user's social response are associated 
with better engagement and lower rate of breakdowns dur- 
ing information-seeking dialogues [2l]. A necessary step in 
designing such interactions is to identify behaviors with a 
potential to evoke a desired user response. 

Data sources that can be used to harvest behavior candi- 
dates include ethnographic and controlled studies. Ethno- 
graphic studies provide an opportunity for collection of nat- 
uralistic conversational data, but often face the issues of un- 
clear sample population and coarse granularity of captured 
data pj. On the other hand, collecting high resolution data 
in a controlled setting may hamper spontaneity and natural- 
ness of the interaction. In general, data collection methodol- 
ogy can influence both the sociopragmatic choices, namely, 
what speech act to say, and their pragmalinguistic realiza- 
tion, namely, how to say it (see [^ for a discussion). 

These methodological difficulties, combined with the chal- 
lenges of annotating multimodal data, result in the lack of 
annotated corpora of naturalistic interactions for many sce- 
narios that are currently relevant for human-robot interac- 
tion research. The corpus of role plays between a visitor and 
a receptionist in a realistic environment that we present in 
this paper attempts to help fill this gap. 

In the next section, we describe related work on corpora 
of service encounters. After that, we introduce our data col- 
lection methodology and the annotation scheme we use. We 
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conclude with the discussion of possible uses of the corpus. 



2. CORPORA OF SERVICE ENCOUNTERS 

Audio corpora of human service encounters have been 
used for analysis of linguistic and paralinguistic features, 
such as timing and prosody. For example, Vienna-Oxford In- 
ternational Corpus of English (VOICE) pO] includes service 
encounters between speakers of English as a lingua franca. 
Audio recordings of Syrian shopping interactions were col- 
lected and analyzed by Traverso [22]. Service encounters 
gathered in public offices and shops of Catalonia were ex- 
amined with respect to how bilinguals negotiate code (lan- 
guage) of their interaction. Audio recordings have been used 
to analyze politeness strategies in shopping interactions (see, 
for example, [12|). 



The importance of gaze (see 15 for an overview) and 
smile (see, for example, [To]) in defining the outcome of 
the service interactions suggest the need for capturing and 
studying nonverbal behaviors in videos. For instance, cus- 
tomers reported higher satisfaction when they interacted 
face-to-face with a bank teller who responded with contin- 
gent smile, rather than constant neutral or constant smiling 
expression IIOI. The same data showed that amused and 
polite smiles differ with respect to their temporal proper- 
ties [9 . Analysis of verbal and nonverbal expressions in 
the videos of interethnic encounters of Korean retailers with 
Korean and African- American customers showed that these 
language communities had different perception of function 
of socially minimal and socially expanded encounters [s] . 

Receptionist interactions, a subtype of service encounters, 
were analyzed with respect to their verbal content via role 
plays in [5]. Hewitt et al. [s] conducted discourse analysis 
of dialogues involving hospital receptionists. The openly ac- 
cessible CUBE-G corpus of nonverbal behaviors from role 
plays of German and Japanese participants covers scenarios 
that may be relevant for service encounters, including first 
meeting, negotiation and status difference 17 . The original 
Map Task [I] and followup projects collect direction-giving 
dialogues that may be relevant to some receptionist encoun- 
ters. 

We were not able to find any nonverbal corpora of hu- 
man receptionist interactions. With respect to availabil- 
ity, among all the corpora mentioned above only VOICE, 
CUBE-G and Map Task related corpora are freely accessi- 
ble. Hence, our corpus may be the first annotated corpus 
of nonverbal behaviors in receptionist interactions, and the 
first nonverbal corpus (excluding the original video and au- 
dio data) of service encounters freely available online [l3] . 



3. DATA COLLECTION 
3.1 Participants 

We recruited via emails and posters in Education City, 
in Doha, Qatar and via announcements posted on bulletin 
boards across CMU campus in Pittsburgh, USA. The re- 
cruitment materials specified that we were looking for na- 
tive speakers of American English or Arabic. Majority of the 
participants (17 of 22) were university students, staff, or fac- 
ulty. The participants filled demographic surveys and evalu- 
ated themselves on ten-item personality inventory (TIPI ) [t] 
and 20-item positive and negative affect scale (PANAS) |24| . 
The distribution of participants is shown in Table [T] 



Doha 


Arabic 


Females 
Males 


2 
6 


American English 


Females 
Males 


2 
3 


Pittsburgh 


Arabic 


Females 
Males 


1 
1 


American English 


Females 
Males 


5 
1 



Table 1: Distribution of participants between Doha 
and Pittsburgh experiment sites 

People apply different criteria when they report their na- 
tive language and mother tongue 
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To control for this, 
we asked the participants to list the countries they lived in 
for more than a year, and their age at the time of moving 
in and out of the country. All but 3 participants (who were 
all in the American English condition in Doha) spent the 
majority of their lives in the country where their native lan- 
guage is a primary spoken language. A female participant in 
Doha changed her reported native language from American 
English to Tulu, after asking the experimenter a clarifica- 
tion question. Her data remains in the corpus although she 
is not included in the Table [l] 

Mean age of participants in Doha was 25 years {SD — 
7.8). In Pittsburgh, average age was 28.7 years {SD = 12.7). 
Native speakers of Arabic were on average 23.2 years old 
{SD = 4.2), while average age of native speakers of Ameri- 
can English was 30.9 years {SD = 12.5). 

3.2 Procedure 

After filling out the questionnaires, one of the participants 
was asked to play the role of a receptionist while another was 
asked to imagine themselves as a first-time visitor looking for 
a particular location inside the building. The location was 
picked by the experimenter from the following list: library, 
restroom, cafeteria, student recreation room, a professor A's 
office, etc. Visitors were asked to seek help of the reception- 
ist for directions using English and then to proceed towards 
their destination. 

Most of the participant pairs were not familiar with each 
other. The fact of familiarity, when clear, is noted in the 
annotations. Similarly, the annotations include information 
on whether the participant has a thorough (works or studies 
inside the building) or passing (works or studies in a nearby 
building) familiarity with the experiment site. 

In both sites, the receptionist would occupy the actual 
receptionist area in the lobby of the building. In Doha, 



on-duty security guards were present in the vicinity of the 
reception desk. 

Each pair of participants would have 2-3 interactions with 
one of the subjects as a receptionist, and then they would 
switch roles and have 2 or 3 more interactions, depending on 
allotted time. After that, the participants were debriefed on 
their experiences. Overall, more than 60 interactions were 
recorded. 

The interactions were recorded with 2 or 3 consumer-level 
high definition cameras. Visitor and receptionist were each 
dedicated a camera capturing their torso, arms and face that 
was positioned about 45 degrees off their default line of sight 
(namely, the line of sight that is perpendicular to the front 
edge of the rectangular reception desk). Most of the inter- 
actions would have a third camera capturing the side view 
of the scene. All cameras were in plain view. In addition to 
the audio captured by the cameras, an audio recorder (iPod) 
was placed on the receptionist desk. 

4. ANNOTATION SCHEME 

The main goal of our corpus is to analyze occurrences and 
timing of verbal and nonverbal behaviors. Consequently, we 
have chosen to annotate the data at the level of granular- 
ity that minimizes the coding effort while at the same time 
allowing to capture timing and major features of commu- 
nicative events. For example, instead of annotating each of 
preparation, hold, stroke, and retraction phases of a hand 
gesture 111] we annotate an interval between beginnings of 
the stroke and retraction phases. Similarly, facial expression 
are annotated as intervals approximately from the beginning 
of rise to the beginning of decay M phases, with some er- 
ror inherent to manual annotation. The annotation scheme, 
developed in the process of annotating the corpus, is sum- 
marized in Table [2] 



Modality 


Values 


Speech 


Transcribed utterances, including 
non-words 


Eye gaze 


Pointing (self-initiated), pointing 
(following interlocutor), focus (in- 
terlocutor, guard, desktop, down, 
up, left, right, front, back, scattered, 
destination) 


Face 


smile (open or closed mouth) 


Head 


nod, half nod, double nod, multiple 
nod, upward nod, multiple upward 
nod, micro nod, shake 


Hand 


Pointing (left or right hand), finger 
only 


Torso 


Sitting, standing, focus (left, right, 
front, back, destination, interlocu- 
tor, desk) 



Table 2: Annotation scheme 

Coding nonverbal expressions, as well as transcribing am- 
biguous speech involves a degree of subjectivity. For exam- 
ple, the exact point of gaze fixation within the recipient's 
face is hard to identify even by the recipient himself [23| . 
In fact, a typical direct eye contact consists of a sequence 
of fixations on different points on the face [6]. Since it is 
unclear whether the exact fixation pattern has any infiuence 



on social communication, in this study we do not distinguish 
between different fixation points within the general face area 
(neither does the video fidelity allow that). We plan to val- 
idate the annotations by employing a second annotator. 

The annotation is done using the multi-track video anno- 
tation tool Advene [2]. 

5. DISCUSSION 

While the small number of individual participants makes 
this corpus unsuitable for cross-subject analysis, the multi- 
ple trials may be accounted for by mixed-effects models [16] . 
More appropriately, the corpus should be used for qualita- 
tive analysis and formation of hypothesis for further stud- 
ies. For example, compare the gaze behaviors of a native 
Arabic-speaking female S4 (Subject 4) playing a reception- 
ist responding to native Arabic-speaking male SI playing a 
visitor (Fig. [H v ersus the dialogue with the subjects' roles 
reversed (Fig.|2|. Notice that both subjects gazed at their 
interlocutor more in the visitor role. This appears to be 
a trend that can be explained in part by the receptionist 
looking towards the destination during the direction-giving 
speech, while the visitor may continue looking at the recep- 
tionist. 

Now, compare a receptionist gaze of S4 (Fig. [l]) with one 
of S12 (Fig. [3|, who is a female native speaker of American 
English. Notice the short glances that punctuate fragments 
of the directions sequence spoken by S12. These glances 
appear to precede visitor's backchannels and therefore may 
play a role in connection events [Ts]. Receptionist S4, on the 
contrary, did not glance at the visitor until the very end of 
the directions sequence. These different gaze behaviors may 
reflect individual styles, genders and cultures of receptionist- 
visitor pairs, or levels of comfort and expertise, among other 
possibilities. Further, more controlled, studies may address 
these hypothesis. 
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(unclear) morning [ 



eh, do you know where is d.. professor 
Majd Sakr office? 



majd sakr 



uhu 



uhu [ 



ok thank you 

Fraction of entire interaction: 

Fraction of receptionist's speech: 

Fraction of visitor's speech: 




0.03 0.36 0.84 0.28 

0.07 0.64 0.93 0.20 

0.00 0.09 0.60 0.68 



hi, good morning 



eh, professor who? 



eh, majd sakr, professor 



so, you may go this way in this 
corridor 



it's the c s corridor 



and then (i or o unclear)n your 
right... 



you have all offices you can read his 
name on the pallet 



on the office 



Figure 1: Interaction between SI as a visitor and S4 as a receptionist. Wide vertical stripes represent intervals 
of speech. Narrow vertical stripes represent (from left to right): intervals of visitor's and receptionist's gaze 
towards the direction pointed by the receptionist, and visitor's and receptionist's gaze tow^ards each other. 
Color coding of these modalities is specified by the icons in the upper part of the plots. 
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receptionist 
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um, excuse me, i am (unclear) looking 

for emu library, can you lead me to 

this library please? 



ok 



ok. so that (unclear) over there? 



ok, thank you so much 



Fraction of entire interaction: 

Fraction of receptionist's speech: 

Fraction of visitor's speech: 




yeah, sure 



the library is there., just you have 
to .. walk that way 



uhu 



yeah... there 



0.31 0.40 
0.52 0.83 
0.18 0.18 



0.61 0.57 
0.48 0.17 
0.70 0.76 



Figure 2: Interaction between S4 as a visitor and SI as a receptionist. Wide vertical stripes represent intervals 
of speech. Narrow vertical stripes represent (from left to right): intervals of visitor's and receptionist's gaze 
towards the direction pointed by the receptionist, and visitor's and receptionist's gaze tow^ards each other. 
Color coding of these modalities is specified by the icons in the upper part of the plots. 
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hey 1= 
(unclear:hey) c 



can i please know where is the 
library? 



yeah 



ol^ay 



it's on the ground floor you mean? 



okay 



okay 
thank you 

Fraction of entire interaction: 

Fraction of receptionist's speech: 

Fraction of visitor's speech: 



0.25 0.33 
0.33 0.42 
0.17 0.16 



0.64 0.60 
0.63 0.47 
0.72 0.84 



Zl hi 

~| how can i help you 




a library 



urn if you come down this hallway 



all the way to the end.. 



and and you take a right 



it's on the ground floor 



it'll be right there you'll er um see 
these glass doors 



and it's the library right there 

so just 

go down the hallway 

and take a right 
and straight 



] (rise) mm-hm 



Figure 3: Interaction between Sll as a visitor and S12 as a receptionist. Tlie visitor's eye gaze for this 
particular dialogue is partially inferred from his head gaze. Wide vertical stripes represent intervals of 
speech. Narrow vertical stripes represent (from left to right): intervals of visitor's and receptionist's gaze 
towards the direction pointed by the receptionist, and visitor's and receptionist's gaze tow^ards each other. 
Color coding of these modalities is specified by the icons in the upper part of the plots. 



